Analyzing Happiness Around The World

Pauline Comising and Amanda Ma

Introduction

With the current state of the world, we wanted to focus on a more cheerful topic: happiness. More specifically, what contributes to our overall happiness, and how does that differ across different parts of the world? We found some interesting datasets to help us answers questions like this in this tutorial.

We will be guiding you through the data science pipeline, walking through three main sections:

  1. Data curation, parsing, and management
  2. Exploratory Data Analysis
  3. Prediction & Machine Learning

Through these steps, we will process, analyze, and draw conclusions from our data, hopefully bringing insight about what people across the globe view as important to their happiness. To learn more about the data science pipeline here are additional resources:

Our Questions

Through our analysis, we strive to answer the following questions:

  1. What regions are the happiest?
  2. What do countries find most important to be happy?
  3. How do real factors (like GDP and life expectancy) affect happiness and what people consider most important?
  4. Can weighing certain aspects as less important, lead to more happiness despite harsh realities such as low life expectancy?

And we look to data science for these problems as a way to:

  1. Take three different datasets and tell a story that goes beyond each of them individually (specifically question 4)
  2. Reveal a variety of different insights answering all the questions by being able to select and filter data from our master dataset
  3. Verifying hypotheses and making predictions through machine learning, specifically linear regression
  4. And in general: make sense of data that is otherwise scattered and hard to understand

Data Curation, Parsing & Management

First, lets import the libraries we will be using in this section:

In [1]:
import pandas as pd

Next, we will load our datasets. Our primary dataset (2019.csv) displays countries, their happiness score out of 10, and various factors that contribute to their overall happiness score. We also have two additional datasets (countries of the world.csv and life-expectancy.csv) that contain more information including actual GDP per capita and actual life expectancy for each country. We will be merging the data together.

In [2]:
data = pd.read_csv('2019.csv')
data
Out[2]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
... ... ... ... ... ... ... ... ... ...
151 152 Rwanda 3.334 0.359 0.711 0.614 0.555 0.217 0.411
152 153 Tanzania 3.231 0.476 0.885 0.499 0.417 0.276 0.147
153 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025
154 155 Central African Republic 3.083 0.026 0.000 0.105 0.225 0.235 0.035
155 156 South Sudan 2.853 0.306 0.575 0.295 0.010 0.202 0.091

156 rows × 9 columns

In [3]:
gdp = pd.read_csv('countries of the world.csv')
gdp
Out[3]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579
2 Algeria NORTHERN AFRICA 32930091 2381740 13,8 0,04 -0,39 31 6000.0 70,0 78,1 3,22 0,25 96,53 1 17,14 4,61 0,101 0,6 0,298
3 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000.0 97,0 259,5 10 15 75 2 22,46 3,27 NaN NaN NaN
4 Andorra WESTERN EUROPE 71201 468 152,1 0,00 6,6 4,05 19000.0 100,0 497,2 2,22 0 97,78 3 8,71 6,25 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
222 West Bank NEAR EAST 2460492 5860 419,9 0,00 2,98 19,62 800.0 NaN 145,2 16,9 18,97 64,13 3 31,67 3,92 0,09 0,28 0,63
223 Western Sahara NORTHERN AFRICA 273008 266000 1,0 0,42 NaN NaN NaN NaN NaN 0,02 0 99,98 1 NaN NaN NaN NaN 0,4
224 Yemen NEAR EAST 21456188 527970 40,6 0,36 0 61,5 800.0 50,2 37,2 2,78 0,24 96,98 1 42,89 8,3 0,135 0,472 0,393
225 Zambia SUB-SAHARAN AFRICA 11502010 752614 15,3 0,00 0 88,29 800.0 80,6 8,2 7,08 0,03 92,9 2 41 19,93 0,22 0,29 0,489
226 Zimbabwe SUB-SAHARAN AFRICA 12236805 390580 31,3 0,00 0 67,69 1900.0 90,7 26,8 8,32 0,34 91,34 2 28,01 21,84 0,179 0,243 0,579

227 rows × 20 columns

In [4]:
life_expec = pd.read_csv('life-expectancy.csv')
life_expec
Out[4]:
Entity Code Year Life expectancy (years)
0 Afghanistan AFG 1950 27.638
1 Afghanistan AFG 1951 27.878
2 Afghanistan AFG 1952 28.361
3 Afghanistan AFG 1953 28.852
4 Afghanistan AFG 1954 29.350
... ... ... ... ...
19023 Zimbabwe ZWE 2015 59.534
19024 Zimbabwe ZWE 2016 60.294
19025 Zimbabwe ZWE 2017 60.812
19026 Zimbabwe ZWE 2018 61.195
19027 Zimbabwe ZWE 2019 61.490

19028 rows × 4 columns

For our two supplementary datasets, we will only select the data that we will be using in our further analysis and visualizations. More specifically, we are keeping the country name, region, and GDP per capita, and life expectancy.

In [5]:
gdp = gdp[['Country', 'Region', 'GDP ($ per capita)']]
gdp
Out[5]:
Country Region GDP ($ per capita)
0 Afghanistan ASIA (EX. NEAR EAST) 700.0
1 Albania EASTERN EUROPE 4500.0
2 Algeria NORTHERN AFRICA 6000.0
3 American Samoa OCEANIA 8000.0
4 Andorra WESTERN EUROPE 19000.0
... ... ... ...
222 West Bank NEAR EAST 800.0
223 Western Sahara NORTHERN AFRICA NaN
224 Yemen NEAR EAST 800.0
225 Zambia SUB-SAHARAN AFRICA 800.0
226 Zimbabwe SUB-SAHARAN AFRICA 1900.0

227 rows × 3 columns

Since the life expectancy dataset had data for each country, and for multiple years, we are also narrowing down our data to only focus on countries in the year 2019, as we want the most up to date analysis.

In [6]:
life_expec = (life_expec[life_expec['Year'] == 2019][['Entity', 'Life expectancy (years)']]
              .rename(columns={'Entity':'Country'})
              .reset_index())
life_expec = life_expec[['Country', 'Life expectancy (years)']]
life_expec 
Out[6]:
Country Life expectancy (years)
0 Afghanistan 64.833
1 Africa 63.170
2 Albania 78.573
3 Algeria 76.880
4 American Samoa 73.745
... ... ...
238 Western Sahara 70.263
239 World 72.584
240 Yemen 66.125
241 Zambia 63.886
242 Zimbabwe 61.490

243 rows × 2 columns

From closer inspection, we realized that there was an extra space after each country name in the second dataset. We are getting rid of that so we can match the Country columns, and prepare to merge the datasets.

This brings up an important point. As a data scientist, one should know their data quite well, including how it is formatted. Knowing these details will make it easier to clean and manage data, and even for visualizations/further analysis.

In [7]:
countries = gdp['Country'].tolist()
new_countries = []
for c in countries:
    new_countries.append(c.strip())
    
gdp['Country'] = pd.DataFrame({'Country': new_countries})
/Users/PaulineComising/miniconda3/envs/tensorflow/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

Now, we will join the datasets on country name. We are using the key word 'inner', meaning that when we merge the data, we will only match the countries and corresponding data that exist in all three datasets.

In [8]:
data = data.rename(columns={"Country or region": "Country"})
data = data.merge(gdp, on='Country', how='inner')

data = data.merge(life_expec, on='Country', how='inner')

data
Out[8]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Region GDP ($ per capita) Life expectancy (years)
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 WESTERN EUROPE 27400.0 81.908
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 WESTERN EUROPE 31100.0 80.898
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 WESTERN EUROPE 37800.0 82.404
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 WESTERN EUROPE 30900.0 82.993
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 WESTERN EUROPE 28600.0 82.283
... ... ... ... ... ... ... ... ... ... ... ... ...
136 150 Malawi 3.410 0.191 0.560 0.495 0.443 0.218 0.089 SUB-SAHARAN AFRICA 600.0 64.263
137 151 Yemen 3.380 0.287 1.163 0.463 0.143 0.108 0.077 NEAR EAST 800.0 66.125
138 152 Rwanda 3.334 0.359 0.711 0.614 0.555 0.217 0.411 SUB-SAHARAN AFRICA 1300.0 69.024
139 153 Tanzania 3.231 0.476 0.885 0.499 0.417 0.276 0.147 SUB-SAHARAN AFRICA 600.0 65.456
140 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025 ASIA (EX. NEAR EAST) 700.0 64.833

141 rows × 12 columns

Now that we've combined our datasets, we see that some column titles are similar. To avoid confusion, let's clean them up and rename them for better clarity.

In [9]:
data = data.rename(columns={"Score": "Happiness Score", "GDP per capita": "GDP Importance"
                            , "Social support": "Social Support Importance"
                            , "Healthy life expectancy": "Healthy Life Expectancy Importance"
                            , "Freedom to make life choices": "Freedom Importance"
                            , "Generosity": "Generosity Importance"
                            , "Perceptions of corruption": "Absence of Corruption Importance"
                            , "GDP ($ per capita)": "Actual GDP"
                            , "Life expectancy (years)": "Actual Life Expectancy"})
data
Out[9]:
Overall rank Country Happiness Score GDP Importance Social Support Importance Healthy Life Expectancy Importance Freedom Importance Generosity Importance Absence of Corruption Importance Region Actual GDP Actual Life Expectancy
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 WESTERN EUROPE 27400.0 81.908
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 WESTERN EUROPE 31100.0 80.898
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 WESTERN EUROPE 37800.0 82.404
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 WESTERN EUROPE 30900.0 82.993
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 WESTERN EUROPE 28600.0 82.283
... ... ... ... ... ... ... ... ... ... ... ... ...
136 150 Malawi 3.410 0.191 0.560 0.495 0.443 0.218 0.089 SUB-SAHARAN AFRICA 600.0 64.263
137 151 Yemen 3.380 0.287 1.163 0.463 0.143 0.108 0.077 NEAR EAST 800.0 66.125
138 152 Rwanda 3.334 0.359 0.711 0.614 0.555 0.217 0.411 SUB-SAHARAN AFRICA 1300.0 69.024
139 153 Tanzania 3.231 0.476 0.885 0.499 0.417 0.276 0.147 SUB-SAHARAN AFRICA 600.0 65.456
140 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025 ASIA (EX. NEAR EAST) 700.0 64.833

141 rows × 12 columns

Exploratory Data Analysis

Now that we have loaded and tidied our data, we will move on to creating visualizations, and drawing conclusions from them. We will be using different visualization libraries, so we can show you different tools and ways you can display your data!

For our visualizations, we will be using Seaborn, Plotly, Matplotlib, and Numpy.

In [10]:
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import matplotlib.pyplot as plt
import numpy as np

To get a better idea of which parts of the world have greater overall happiness, we will create boxplots using Seaborn, comparing different regions' happiness scores.

In [11]:
ax = sns.boxplot(x = 'Region', y = 'Happiness Score', data = data)
ax.set_title("Happiness Across Regions Around the World")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

From this plot, we can get a clear idea of which regions are generally the happiest. We see that Western Europe, Oceania, and North America are leading with the greatest happiness scores. The boxplots also show the variability amongst countries in the specified region. For example, the 'Near East' region has the greatest range, showing that there are countries that have a very high happiness scores, but also some that have quite a low score.

Another way to display which countries/regions are happier is through a map, using Plotly. We will create a choropleth map, with a range of colors indicating the happiness score of each country listed in the dataset.

In [12]:
d = dict(type = 'choropleth', 
           locations = data['Country'],
           locationmode = 'country names',
           z = data['Happiness Score'], 
           text = data['Country'],
           colorbar = {'title':'Happiness'})
layout = dict(title = 'Happiness Across the World', 
             geo = dict(showframe = False))
cmap = go.Figure(data = [d], layout=layout)
iplot(cmap)

This map creates a great visual for analysis, as we can compare different regions' happiness, and specific countries as well. We can also hover over specific areas to see the country name and their happiness score, which may make this visualization even more useful and detailed than the first one. Looking at the map, we can see that North America, Europe, and Australia seem to be leading in the happiness ranking, which is similar to our initial conclusions from the boxplots. The variability in regions that we pointed out in the first plot can now be seen more clearly as well; for example, we can see how some Eastern and Asian countries have low happiness, while others are much happier.

Now that we've analyzed which countries/areas have the greatest happiness rankings, we can move into the 'why?'. What contributes to their happiness? For closer analysis, we will separate the countries into 3 bins: countries with high happiness, medium happiness, and low happiness. We will compare which factors are more important to the countries in these three categories, and whether or not they are similar or different based on their current happiness levels.

In [13]:
data['General Happiness Level'] = pd.cut(data["Happiness Score"], 3
                                   , labels=["Low", "Medium", "High"])
low_happiness =  data.loc[data['General Happiness Level'] == "Low"]
medium_happiness = data.loc[data['General Happiness Level'] == "Medium"]
high_happiness = data.loc[data['General Happiness Level'] == "High"]
data
Out[13]:
Overall rank Country Happiness Score GDP Importance Social Support Importance Healthy Life Expectancy Importance Freedom Importance Generosity Importance Absence of Corruption Importance Region Actual GDP Actual Life Expectancy General Happiness Level
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393 WESTERN EUROPE 27400.0 81.908 High
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410 WESTERN EUROPE 31100.0 80.898 High
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341 WESTERN EUROPE 37800.0 82.404 High
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118 WESTERN EUROPE 30900.0 82.993 High
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298 WESTERN EUROPE 28600.0 82.283 High
... ... ... ... ... ... ... ... ... ... ... ... ... ...
136 150 Malawi 3.410 0.191 0.560 0.495 0.443 0.218 0.089 SUB-SAHARAN AFRICA 600.0 64.263 Low
137 151 Yemen 3.380 0.287 1.163 0.463 0.143 0.108 0.077 NEAR EAST 800.0 66.125 Low
138 152 Rwanda 3.334 0.359 0.711 0.614 0.555 0.217 0.411 SUB-SAHARAN AFRICA 1300.0 69.024 Low
139 153 Tanzania 3.231 0.476 0.885 0.499 0.417 0.276 0.147 SUB-SAHARAN AFRICA 600.0 65.456 Low
140 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025 ASIA (EX. NEAR EAST) 700.0 64.833 Low

141 rows × 13 columns

Above, you can see we've added a column showing the general happiness level for each country (high, medium, or low).

In [14]:
# labels for x axis 
factors = ['GDP', 'Social Support', 'Healthy Life Expectancy', 'Freedom'
          , 'Generosity', 'Absence of Corruption']

Let's create boxplots with Matplotlib, comparing the importance of different attributes, based on happiness levels.

In [15]:
# boxplot comparing attributes for low happiness countries
val = [low_happiness['GDP Importance'].values.tolist()
       , low_happiness['Social Support Importance'].values.tolist()
       , low_happiness['Healthy Life Expectancy Importance'].values.tolist()
       , low_happiness['Freedom Importance'].values.tolist()
       , low_happiness['Generosity Importance'].values.tolist()
       , low_happiness['Absence of Corruption Importance'].values.tolist()]
plt.boxplot(val, patch_artist=True)
plt.xticks([1, 2, 3, 4, 5, 6], factors, rotation=90)
plt.xlabel('Factor Contributing to Happiness')
plt.title('Importance of Factors for Countries with Low Happiness Rankings')
plt.show()
In [16]:
# boxplot comparing attributes for medium happiness countries
val = [medium_happiness['GDP Importance'].values.tolist()
       , medium_happiness['Social Support Importance'].values.tolist()
       , medium_happiness['Healthy Life Expectancy Importance'].values.tolist()
       , medium_happiness['Freedom Importance'].values.tolist()
       , medium_happiness['Generosity Importance'].values.tolist()
       , medium_happiness['Absence of Corruption Importance'].values.tolist()]
plt.boxplot(val, patch_artist=True)
plt.xticks([1, 2, 3, 4, 5, 6], factors, rotation=90)
plt.xlabel('Factor Contributing to Happiness')
plt.title('Importance of Factors for Countries with Medium Happiness Rankings')
plt.show()
In [17]:
# boxplot comparing attributes for high happiness countries
val = [high_happiness['GDP Importance'].values.tolist()
       , high_happiness['Social Support Importance'].values.tolist()
       , high_happiness['Healthy Life Expectancy Importance'].values.tolist()
       , high_happiness['Freedom Importance'].values.tolist()
       , high_happiness['Generosity Importance'].values.tolist()
       , high_happiness['Absence of Corruption Importance'].values.tolist()]
plt.boxplot(val, patch_artist=True)
plt.xticks([1, 2, 3, 4, 5, 6], factors, rotation=90)
plt.xlabel('Factor Contributing to Happiness')
plt.title('Importance of Factors for Countries with High Happiness Rankings')
plt.show()

From the three graphs, we can see that social support ranks highest in all three categories. Therefore, regardless of a country's current happiness level, social support is most important to them and is the greatest contribution to happiness across the world.

However, looking at other categories, there seems to be more variability. GDP, for example, is one of the most variable factors. For countries with high happiness, the GDP seems to be almost as important as social support. However, as we look at countries with decreasing happiness levels, it is not as clear. With medium happiness, countries still hold GDP quite high behind social support, but it is lower than the plot in high happiness. When we look at low happiness however, the importance of GDP is significantly lower, and is now tied with healthy life expectancy. Thus, we can conclude that countries that are happier tend to place more importance in GDP.

The other attributes (life expectancy, freedom, generosity, and absence of corruption) also factor into countries' happiness, but rank about the same in all three levels.

Next, we want to show another way to display our data, through linear regression! Using Matplotlib, we will graph each factor against happiness. This is a great way to see the general pattern of our data and to compare attributes. A greater slope means a stronger correlation in data, so we keep an eye out for that when working with linear regression. We are displaying two graphs in order to make things less cluttered - easier for analysis!

In [18]:
factors = data.iloc[:,:9]
In [19]:
fig, axs = plt.subplots(2,figsize=(15,15))
axs[0].set_xlim([-.25,1.75])
axs[1].set_xlim([-.25,1.75])
axs[0].set_ylabel('Happiness')
axs[1].set_ylabel('Happiness')
axs[0].set_xlabel('Importance')
axs[1].set_xlabel('Importance')
axs[0].set_title('Effect of Social Support, Freedom to Make Choices, Absence of Corruption Importance on Happiness')
axs[1].set_title('Effect of GDP, Healthy Life Expectancy, Generosity Importance on Happiness')

# split the factors between the two subplots to clearly 
# see the different factors better
for i in range(3,9):
    if i % 2 == 0:
        plot = 0
    else:
        plot = 1
    axs[plot].scatter(x = factors.iloc[:,i], y = factors["Happiness Score"], label=data.columns[i])
    m, b = np.polyfit(factors.iloc[:,i], factors["Happiness Score"],1)
    axs[plot].plot(factors.iloc[:,i], m*factors.iloc[:,i] + b, label='y={:.2f}x+{:.2f}'.format(m,b))
    
axs[0].legend()
axs[1].legend()
Out[19]:
<matplotlib.legend.Legend at 0x1a1eeedb00>

In our first graph, we are comparing the importance of social support, freedom, and absence of corruption. The lines are quite similar, but the key/legend we included shows that absence of corruption has the strongest correlation. In our second graph, it is much clearer that healthy life expectancy importance has the strongest correlation compared to GDP and generosity. While the difference was clear in the second plot, the differences in the first plot were harder to see, so including a key with additional information can be incredibly important for accurate data analysis.

Looking at the graphs together, the importance of absence of corruption leads with the strongest correlation. This means that as countries get happier, they tend to place more significance on their lack of corruption, more so than other factors like GDP, life expectancy, etc. The importance of generosity had the weakest correlation, indicating that generosity is the least closely tied to a country's happiness. However, it certainly does not mean that absence of corruption is the most important, or that generosity is the least important to a country's happiness score.

We also want to test if there is a relationship between actual factors, like true GDP and true life expectancy, on their importance. So, does a country's actual GDP affect how much weight a country gives it, and how does that affect happiness? We want to test for the same for life expectancy, as well.

First, let's organize this data:

In [21]:
exp = data[['Country','Happiness Score','GDP Importance','Actual GDP','Healthy Life Expectancy Importance','Actual Life Expectancy']]
exp = exp.rename(columns={'Healthy Life Expectancy Importance':'Life Expectancy Importance',
                          'Happiness Score':'Happiness'})
exp["Life Expectancy Levels"] = pd.cut(exp["Actual Life Expectancy"], 3, labels=["Low", "Medium", "High"])
exp
Out[21]:
Country Happiness GDP Importance Actual GDP Life Expectancy Importance Actual Life Expectancy Life Expectancy Levels
0 Finland 7.769 1.340 27400.0 0.986 81.908 High
1 Denmark 7.600 1.383 31100.0 0.996 80.898 High
2 Norway 7.554 1.488 37800.0 1.028 82.404 High
3 Iceland 7.494 1.380 30900.0 1.026 82.993 High
4 Netherlands 7.488 1.396 28600.0 0.999 82.283 High
... ... ... ... ... ... ... ...
136 Malawi 3.410 0.191 600.0 0.495 64.263 Low
137 Yemen 3.380 0.287 800.0 0.463 66.125 Medium
138 Rwanda 3.334 0.359 1300.0 0.614 69.024 Medium
139 Tanzania 3.231 0.476 600.0 0.499 65.456 Medium
140 Afghanistan 3.203 0.350 700.0 0.361 64.833 Medium

141 rows × 7 columns

Now, let's graph it!

In [22]:
fig, axs = plt.subplots(2,2,figsize=(15,15))
ind = 0
for factor in ['GDP', 'Life Expectancy']:
    # create our two graphs comparing real life circumstance to its weighted importance
    axs[ind][0].scatter(x = exp['Actual ' + factor], y = exp[factor + ' Importance'])
    axs[ind][0].set_title('Effect of Actual ' + factor + ' on ' + factor + ' Importance')
    axs[ind][0].set_xlabel('Actual ' + factor)
    axs[ind][0].set_ylabel(factor + ' Importance')
    for level in ['High','Medium','Low']:
        # for both GDP and Life Expectancy, graph all three levels of real life
        # circumstance in different colors and label them
        axs[ind][1].scatter(x = exp[exp['Life Expectancy Levels'] == level][factor + ' Importance'],
                        y = exp[exp['Life Expectancy Levels'] == level]['Happiness'], label = level + ' ' + factor)
        axs[ind][1].set_title('Effect of ' + factor + ' Importance on Happiness')
        axs[ind][1].set_ylabel('Happiness')
        axs[ind][1].set_xlabel(factor + ' Importance')
    ind = ind + 1

axs[0][1].legend()
axs[1][1].legend()
Out[22]:
<matplotlib.legend.Legend at 0x1a1f445400>

The plots on the left show how actual GDP and life expectancy affect their importance. The curve of the first graph indicates that some countries with lower GDP do not find it important; however, as GDP increases, they place more value in it. For life expectancy, the clear linear path shows that as life expectancy increases, it becomes more important.

The righthand plots show that as actual GDP and life expectancy increase, countries become happier. So, generally, as GDP and life expectancy increase, countries think of them as more important and it contributes to their overall happiness. Especially with life expectancy, there seems to be a strict divide in importance depending on whether or not countries have high life expectancies. GDP has a similar trend, but with blurred boundaries. Although, because there are no data points with cominations such as high life expectancy importance and low life expectancy, it's hard to see whether high importance with low reality leads to unhappiness, or any similar hypothesis.

Prediction and Machine Learning

For this section, we are using statsmodels:

In [23]:
import statsmodels.formula.api as sm

First, we want to prepare a new dataframe with the our old dataframe's names with no spaces, because of the format that statsmodels takes in attributes to create their linear regressions. An 'All' columns is created for Mean Squared Error analysis later on and will be used as the attribute to group all elements by.

In [24]:
model = (data.rename(columns={'Happiness Score':'Score','GDP Importance':'GDPPerCapita', 'Social Support Importance':'SocialSupport','Healthy Life Expectancy Importance':'HealthLifeExpectancy', 
                             'Freedom Importance':'FreedomToMakeLifeDecisions', 'Absence of Corruption Importance':'PerceptionsOfCorruption',
                             'Generosity Importance':'Generosity','Actual GDP':'RealGDP','Actual Life Expectancy':'RealLifeExpectancy'})
            .iloc[:,2:])

# Create an all attribute with the value 1 for every entity
model['All'] = [1] * len(model['Score'])

For future reference, a clean representation of the current column names we now have:

In [25]:
model.columns
Out[25]:
Index(['Score', 'GDPPerCapita', 'SocialSupport', 'HealthLifeExpectancy',
       'FreedomToMakeLifeDecisions', 'Generosity', 'PerceptionsOfCorruption',
       'Region', 'RealGDP', 'RealLifeExpectancy', 'General Happiness Level',
       'All'],
      dtype='object')

Now, to get the multi-dimensional linear regression:

In [26]:
# Use statsmodel to calculate a linear regression
multi_res = sm.ols('Score~1+GDPPerCapita+SocialSupport+HealthLifeExpectancy+FreedomToMakeLifeDecisions+Generosity+PerceptionsOfCorruption', data=model).fit()
multi_res.summary()
Out[26]:
OLS Regression Results
Dep. Variable: Score R-squared: 0.779
Model: OLS Adj. R-squared: 0.769
Method: Least Squares F-statistic: 78.82
Date: Mon, 18 May 2020 Prob (F-statistic): 1.60e-41
Time: 22:19:49 Log-Likelihood: -108.54
No. Observations: 141 AIC: 231.1
Df Residuals: 134 BIC: 251.7
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.5899 0.235 6.768 0.000 1.125 2.055
GDPPerCapita 0.7031 0.230 3.053 0.003 0.248 1.159
SocialSupport 1.2628 0.259 4.881 0.000 0.751 1.775
HealthLifeExpectancy 1.1869 0.353 3.364 0.001 0.489 1.885
FreedomToMakeLifeDecisions 1.3124 0.403 3.256 0.001 0.515 2.110
Generosity 0.8086 0.540 1.496 0.137 -0.260 1.877
PerceptionsOfCorruption 1.0849 0.566 1.918 0.057 -0.034 2.203
Omnibus: 5.843 Durbin-Watson: 1.859
Prob(Omnibus): 0.054 Jarque-Bera (JB): 5.407
Skew: -0.410 Prob(JB): 0.0670
Kurtosis: 3.499 Cond. No. 29.2


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

When analyzing these summary tables, the coefficients of each attribute is a good start to see the correlation the model draws between, in our case, Happiness Score and each of these attributes. At a glance, you can see all of them have positive coefficients, meaning that as the importance of all of these factors increase, the happier a person is expected to be. The next thing we can look at is P-score. When P-score is < 0.05, a relationship is able to be called significant. In this case, it looks like all but Generosity and Perception of Corruption have a significant relationship to Happiness Score. If you recall, our previous scatter plots with linear regressions of each attribute seperately also seemed to show that the despite consistently low generosity and corruption importance, Happiness Scores had a large range, which makes sense under these findings as well.

To better understand how well this model creates Happiness Score predictions, let's create a prediction for each country to compare with the actual score. While visualizations are always most helpful, the multi-dimensioinality of our data makes such a task hard, so instead we'll analyze our data through mean squared error. To calculate it, we'll be computing the difference squared as an attribute and aggregating to get the mean, as shown below.

In [27]:
# First, create a dataframe consisting of the attributes included in the linear regression
multi = model[['Score','GDPPerCapita','SocialSupport','HealthLifeExpectancy','FreedomToMakeLifeDecisions','Generosity','PerceptionsOfCorruption']]

# Use the regression to predict a happiness score
prediction = multi_res.predict(multi)

# Add predictions and difference squared to our model dataframe
model['Multi Res'] = prediction
model['MR Difference Squared'] = (model['Score'] - model['Multi Res']) ** 2 
model[['Score','Multi Res','MR Difference Squared']].head()
Out[27]:
Score Multi Res MR Difference Squared
0 7.769 7.038702 0.533335
1 7.600 7.156368 0.196810
2 7.554 7.234479 0.102093
3 7.494 7.018640 0.225968
4 7.488 6.993822 0.244212
In [28]:
# Calculate Mean Squared Error
mse = model.groupby('All').agg({'All':'count','MR Difference Squared':'mean'}).rename(columns={'MR Difference Squared':'MR MSE'})
mse
Out[28]:
All MR MSE
All
1 141 0.273001

Now that we've gained more insight into how well the linear regression predicts, let's see if there's some kind of interaction between real GDP and life expectancy and importance when it comes to predicting happiness.

In [29]:
interact_res = sm.ols('Score~GDPPerCapita*RealGDP+HealthLifeExpectancy*RealLifeExpectancy',data=model).fit()
interact_res.summary()
Out[29]:
OLS Regression Results
Dep. Variable: Score R-squared: 0.722
Model: OLS Adj. R-squared: 0.710
Method: Least Squares F-statistic: 58.05
Date: Mon, 18 May 2020 Prob (F-statistic): 6.70e-35
Time: 22:19:52 Log-Likelihood: -124.74
No. Observations: 141 AIC: 263.5
Df Residuals: 134 BIC: 284.1
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.0756 2.095 4.333 0.000 4.933 13.218
GDPPerCapita 0.9549 0.284 3.363 0.001 0.393 1.516
RealGDP 5.431e-05 5.49e-05 0.990 0.324 -5.42e-05 0.000
GDPPerCapita:RealGDP -1.94e-05 3.42e-05 -0.568 0.571 -8.7e-05 4.82e-05
HealthLifeExpectancy 0.6046 2.772 0.218 0.828 -4.878 6.087
RealLifeExpectancy -0.1067 0.037 -2.852 0.005 -0.181 -0.033
HealthLifeExpectancy:RealLifeExpectancy 0.0463 0.037 1.264 0.208 -0.026 0.119
Omnibus: 4.125 Durbin-Watson: 1.561
Prob(Omnibus): 0.127 Jarque-Bera (JB): 4.133
Skew: -0.413 Prob(JB): 0.127
Kurtosis: 2.856 Cond. No. 1.43e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Unfortunately, it seems like only GDP importance and real life expectancy have a significant relationship under this model, with incredibly high P-scores for both interactions as well as other stand alone attributes. Sadly,this means that we cannot make any claims about the interaction between real life circumstance and perception of importance when it comes to happiness, but learning there is no relationship is still learning something and remains an important conclusion that can be drawn though machine learning! To further understand the accuracy of this model's results in a comparable metric to our first, we'll repeat the same Mean Squared Error analysis.

In [30]:
multi = model[['Score','GDPPerCapita','RealGDP','HealthLifeExpectancy','RealLifeExpectancy']]
prediction = interact_res.predict(multi)
model['Interact Res'] = prediction
model['IR Difference Squared'] = (model['Score'] - model['Interact Res']) ** 2 
model[['Score','Multi Res','MR Difference Squared','Interact Res','IR Difference Squared']].head()
Out[30]:
Score Multi Res MR Difference Squared Interact Res IR Difference Squared
0 7.769 7.038702 0.533335 6.728225 1.083213
1 7.600 7.156368 0.196810 6.953223 0.418321
2 7.554 7.234479 0.102093 7.210752 0.117819
3 7.494 7.018640 0.225968 6.953117 0.292554
4 7.488 6.993822 0.244212 6.819074 0.447463
In [31]:
mse = model.groupby('All').agg({'All':'count','IR Difference Squared':'mean'}).rename(columns={'IR Difference Squared':'IR MSE'})
mse
Out[31]:
All IR MSE
All
1 141 0.343529

It looks as if, on average, this interaction model does worse at predicting than our first model. Although, before picking the best of these two, we'd like to try one last model.

But! before we move on, let's take a quick break from our regular happiness-model programming to do some one-dimensional models testing the relationships we saw between actual GDP and life expectancy and their importance to people. And as we do some one dimensional modeling, let's visualize our regressions!

In [32]:
GDP_res = sm.ols('GDPPerCapita~RealGDP', data=model).fit()
GDP_res.summary()
Out[32]:
OLS Regression Results
Dep. Variable: GDPPerCapita R-squared: 0.627
Model: OLS Adj. R-squared: 0.625
Method: Least Squares F-statistic: 234.0
Date: Mon, 18 May 2020 Prob (F-statistic): 1.37e-31
Time: 22:19:55 Log-Likelihood: 1.3895
No. Observations: 141 AIC: 1.221
Df Residuals: 139 BIC: 7.119
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.6342 0.028 22.849 0.000 0.579 0.689
RealGDP 2.926e-05 1.91e-06 15.297 0.000 2.55e-05 3.3e-05
Omnibus: 8.521 Durbin-Watson: 1.366
Prob(Omnibus): 0.014 Jarque-Bera (JB): 9.070
Skew: -0.610 Prob(JB): 0.0107
Kurtosis: 2.760 Cond. No. 1.98e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.98e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [33]:
le_res = sm.ols('HealthLifeExpectancy~RealLifeExpectancy', data=model).fit()
le_res.summary()
Out[33]:
OLS Regression Results
Dep. Variable: HealthLifeExpectancy R-squared: 0.949
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 2598.
Date: Mon, 18 May 2020 Prob (F-statistic): 7.72e-92
Time: 22:19:56 Log-Likelihood: 214.97
No. Observations: 141 AIC: -425.9
Df Residuals: 139 BIC: -420.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.4854 0.044 -33.895 0.000 -1.572 -1.399
RealLifeExpectancy 0.0303 0.001 50.972 0.000 0.029 0.031
Omnibus: 121.244 Durbin-Watson: 1.925
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1657.654
Skew: -2.958 Prob(JB): 0.00
Kurtosis: 18.721 Cond. No. 723.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Looks like we have small P-scores and significant relationships for both life expectancy and GDP pairs, now to visualize our regressions using predictions:

In [34]:
le = pd.DataFrame({'RealLifeExpectancy': [model.RealLifeExpectancy.min(), model.RealLifeExpectancy.max()]})
preds = le_res.predict(le)
preds
# first, plot the observed data
model.plot(kind='scatter', x='RealLifeExpectancy', y='HealthLifeExpectancy')

# then, plot the least squares line
plt.plot(le, preds, c='red', linewidth=2)
plt.title('Effect of Real Life Expectancy on its Importance')
Out[34]:
Text(0.5, 1.0, 'Effect of Real Life Expectancy on its Importance')
In [35]:
gdp = pd.DataFrame({'RealGDP': [model.RealGDP.min(), model.RealGDP.max()]})
preds = GDP_res.predict(gdp)
preds
# first, plot the observed data
model.plot(kind='scatter', x='RealGDP', y='GDPPerCapita')

# then, plot the least squares line
plt.plot(gdp, preds, c='red', linewidth=2)
plt.title('Effect of Real GDP on its Importance')
Out[35]:
Text(0.5, 1.0, 'Effect of Real GDP on its Importance')

The linear regression looks to match life expectancy pretty well, while GDP is a little bit more of a stretch, so before we return to our main happiness plots, take this as a lesson to use visualization and further analysis when possible, because other metrics such as the P-scores of our model may not be giving the full picture. Beyond visualization, it also looks like Mean Squared Error analysis would've revealed a less than perfect regression, but we'll leave that to you to try and calculate! Does the GDP regression perform significantly worse than the life expectancy regression?

Now, back to happiness models. The final thing we'd like to try is one last multi-dimensional linear regression using only the attributes that got a P-score less than 0.05 from our first model.

In [36]:
selective_res = sm.ols('Score~1+GDPPerCapita+SocialSupport+HealthLifeExpectancy+FreedomToMakeLifeDecisions',data=model).fit()
selective_res.summary()
Out[36]:
OLS Regression Results
Dep. Variable: Score R-squared: 0.766
Model: OLS Adj. R-squared: 0.759
Method: Least Squares F-statistic: 111.0
Date: Mon, 18 May 2020 Prob (F-statistic): 7.79e-42
Time: 22:19:59 Log-Likelihood: -112.78
No. Observations: 141 AIC: 235.6
Df Residuals: 136 BIC: 250.3
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.7325 0.227 7.643 0.000 1.284 2.181
GDPPerCapita 0.7474 0.230 3.245 0.001 0.292 1.203
SocialSupport 1.1205 0.259 4.330 0.000 0.609 1.632
HealthLifeExpectancy 1.2636 0.360 3.512 0.001 0.552 1.975
FreedomToMakeLifeDecisions 1.8239 0.370 4.929 0.000 1.092 2.556
Omnibus: 3.293 Durbin-Watson: 1.859
Prob(Omnibus): 0.193 Jarque-Bera (JB): 2.896
Skew: -0.344 Prob(JB): 0.235
Kurtosis: 3.142 Cond. No. 18.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

All the P-values are less than 0.05! But, let's do the same Mean Squared Error analysis to see how predictions line up.

In [37]:
select = model[['Score','GDPPerCapita','SocialSupport','HealthLifeExpectancy','FreedomToMakeLifeDecisions']]
prediction = selective_res.predict(select)
model['Selective Res'] = prediction
model['SR Difference Squared'] = (model['Score'] - model['Selective Res']) ** 2 
model[['Score','Multi Res','MR Difference Squared','Interact Res','IR Difference Squared','Selective Res','SR Difference Squared']].head()
Out[37]:
Score Multi Res MR Difference Squared Interact Res IR Difference Squared Selective Res SR Difference Squared
0 7.769 7.038702 0.533335 6.728225 1.083213 6.845277 0.853265
1 7.600 7.156368 0.196810 6.953223 0.418321 6.867070 0.537186
2 7.554 7.234479 0.102093 7.210752 0.117819 7.016134 0.289299
3 7.494 7.018640 0.225968 6.953117 0.292554 6.958057 0.287235
4 7.488 6.993822 0.244212 6.819074 0.447463 6.759595 0.530574
In [38]:
mse = model.groupby('All').agg({'All':'count','SR Difference Squared':'mean'}).rename(columns={'SR Difference Squared':'SR MSE'})
mse
Out[38]:
All SR MSE
All
1 141 0.289919

It looks neck and neck, so the final thing we'll do is see how the distributions of error for each model look with a Box plot visualization!

In [39]:
val = [model['MR Difference Squared'].values.tolist()
       , model['IR Difference Squared'].values.tolist()
       , model['SR Difference Squared'].values.tolist()]
plt.boxplot(val, patch_artist=True)
plt.xticks([1, 2, 3], ['Multi-Res','Interact-Res','Selective-Res'], rotation=45)
plt.xlabel('Linear Model')
plt.ylabel('Difference Squared')
plt.title('Difference Squared Distributions across Different Linear Models')
plt.show()

Although they look fairly similar, Multi-Res not only shows a lower distribution, but it also looks more concentrated towards lower differences squared, and has lower outliers. From this, we can use our multi-res model to draw our final conclusions and wrap up this tutorial!

Summary/Conclusion

From all this, we've seen that how we weigh the importance of factors such as GDP, life expectancy, social support, and freedom, have a big impact on how happy we are. On top of this, although we can't say whether the interaction of circumstance and perception affect happiness and how to optimize such relationships, we do know that better circumstances lead to people giving such factors more importance, which is seen to increase happiness scores. Though, beyond these happiness insights, we hope we've left you with all the tools you need to make some of your own! From data cleaning to exploratory analysis to prediction and machine learning, we hope you've been able learn how to create your own pipeline and use any of these libraries that have been newly introduced to you. And finally, just as data science has endless possibilities, we hope you don't stop here as you open yourself up to different ways of visualizing data, such as going beyond 2D representations, and different machine learning techniques such as decision trees and random forests. As you continue your explorations, we hope this was a helpful start!

If you want to learn even more about data science, feel free to check out these resources: